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Abstract 

The traditional approach of health risk modelling with multiple 
data sources proceeds via regression-based methods assuming a marginal 
distribution for the outcome variable. The data is collected for N 
subjects over a J time-period or from J data sources. The response 
obtained from subject is U = {Yu, ■ • • , Yij). For N subjects we ob¬ 
tain a J dimensional joint distribution for the subjects. In this work we 
propose a novel approach of transforming any J dimensional joint dis¬ 
tribution to that of a J dimensional Gaussian keeping the Shannon en¬ 
tropy constant. This is in stark contrast to the traditional approaches 
of assuming a marginal distribution for each Yij by treating the Uf s 
as independent observations. The said transformation is implemented 
in our computer package called ENTRA. 


1 Introduction 

Information about the health outcomes in many epidemiological studies is 
obtained from multiple data sources or over a certain time-period with mul¬ 
tiple observations. The multiple data sources provide multiple measures of 
the same underlying variable, measured on a similar scale. As an example 
N adults are chosen for high-blood pressure study at the age of 18 and are 
asked about their diet, smoking and drinking habits and their blood pres¬ 
sures are measured. The same subjects over the course of time are monitored 
again on the basis of the same variables choice as before. This illustrates the 
standard data collecting exercise to measure health risk. Once such a data 
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is available we can begin the risk modelling to ascertain the factors which 
contribute towards high-blood pressure and those that contribute towards 
low-blood pressure. 

To put it mathematically, given N subjects and J data sources or time- 
points the data is collected as a p dimensional vector of covariates, Xij = 
{Xiji,--- ,Xijp) where 1 < i < and 1 < j < J. Given such a vec¬ 
tor the outcome is reported as a variable Yij for the subject and from 
the data source. Then we can construct a J dimensional vector Yi as 
Yi = (Til, • • • ,Yij). In a J dimensional space the above vector for a given 
value of i is a point. Since 1 < i < iV the total number of points in the 
J dimensional space equal N. These N points follow a certain distribu¬ 
tion which is a-priori unknown. In the conventional analysis the outcome 
variables Y/jS are treated as independent variables and nothing is assumed 
about the joint distribution in the J dimensional space. The assumption 
about the independence is not correct but as we will see in Section this 
does not affect the statistical analysis that we intend to carry out. 

We propose a novel analysis tool to carry out health risk modelling by 
transforming the J dimensional a-priori unknown density to that of a Gaus¬ 
sian density whilst keeping the Shannon entropy constant. To do so we 
transform the N number of J dimensional vectors in a basis set consisting 
of divergence-free vector fields. [2j. The condition that the basis set can 
only consist of divergence-free vector fields enforces the entropy conserving 
condition. Entropy conserving condition can be enforced by having volume 
preserving maps and our choice of the basis set represents a flow of incom¬ 
pressible fluid mm hence is a volume preserving map. 

To determine the coefficients of these basis vectors such that the J di¬ 
mensional density is a Gaussian, Karplus theorem is used [H [5]. We will 
demonstrate in Section]^ that how this theorem allows for determination of 
the basis coefficients in such a way that the J dimensional density is trans¬ 
formed to a Gaussian. 

The paper is divided as follows, in Section]^ the traditional approach 
to model health risk is reviewed and a novel approach is proposed. In Sec¬ 
tion]^ the construction of basis set consisting of high dimensional vector 
fields is shown. In Section]^ the question of determining the coefficients of 
the basis set is settled. The algorithm and program structure is discussed in 
Section]^. In Section]^ a test case is computed to see how the transforma¬ 
tion works in practice. Finally we conclude our work done so far and make 
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proposals for the future work. 


2 Regression methods for multiple source outcomes 

Consider a trial to model high-blood pressure with N subjects monitored 
over J time-period. The response obtained from the subject can be 
written as a J dimensional vector as 


Yi = {Ya,--- ,Y^J). ( 1 ) 

Traditionally a joint distribution for N snbjects is not specified. Instead a 
working generalized linear model (GLM) to describe the marginal distribu¬ 
tion of Yij as in Liang, Zeger (1986) [6] is nsed, 


f{Yij) = exp 


Yf) 

J ZJ ^ZJ 



( 2 ) 


If the output Yij is a binary random variable then the parameters for 
the above exponential family are 


4> = l,a{9ij) = log[l -F e^'^],6ij 


log 




_ 1 jJ'ij _ 


b{Yij,eij) = 0 


( 3 ) 


The probability of a favorable outcome if Yij is a binary random variable 
can be modelled via a Logit function as 


Logit(P[yi,- = 0|Xi,]) = XijP, (4) 

where Yij = 0 represents the favourable outcome i.e low risk of high- 
blood pressure whereas Yij = 1 corresponds to high-risk of high-blood pres¬ 
sure. 

Given Eq[^ the log-likelihood function can be written as 

N J 

InL = YijOij — a{9ij) (5) 

i=i j=i 

To determine the regression parameters j3 we differentiate the log-likelihood 
with respect to j3, this gives the following equation to estimate /3's 

^ ^ Yij {Yij — 9'ij) = 0. (6) 

9(3 .=1 ,=i 
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This is the traditional approach to determine the regression parameters 
that help us evaluate the factors which contribute towards high or low health 
risks given the data. Equation]^ was derived assuming that the outcome 
Yij is a binary random variable, however similar equation can be derived 
when the outcome is of any other type. 

In this approach it is assumed that all the Y-jS are independent obser¬ 
vations. This assumption although not correct, yields the estimates for /? 
which are valid but their variances are not. However using techniques such 
as empirical variance estimator valid standard errors can be obtained[7j. 

2.1 A novel approach 

Having reviewed the traditional approach we now present our idea. As re¬ 
marked earlier the joint distribution of the N subjects is not specified in the 
J dimensional space as this is an a-priori unknown. However if the unknown 
distribution is transformed to that of a J dimensional Gaussian then a valid 
analysis tool can be developed. We have developed an algorithm and a pro¬ 
gram which does this transformation in an entropy conserving way i.e the 
Shannon entropy is preserved during the transformation. 

To preserve the Shannon entropy requires transformation of the high 
dimensional vectors in a basis set consisting of divergence-free vector fields 
as that represents a volume preserving map thereby preserving Shannon en¬ 
tropy. Such a construction of orthocomplete basis set is available in any 
dimensions [3]. We use this mathematical construct to transform then N 
number of J dimensional vectors. To determine the basis coefficients such 
that the J dimensional density is transformed to that of a Gaussian Karplus 
theorem discussed in Section]^ is invoked. Once the coefficients are deter¬ 
mined the J dimensional joint distribution of the subjects is a Gaussian 
having the same Shannon entropy as the starting distribution. This has 
been implemented in ENTRA. 

Once the joint distribution for the N subjects is known log-likelihood 
function can be written for such a distribution and the regression parameters 
determined thereby. In the next Section we show the construction of the 
divergence-free vector fields used in our program ENTRA. 
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3 J dimensional divergence-free vector fields 

A mathematically rigorous construction of divergence free smooth vector 
fields in any dimensions was provided in [3]. We use it to construct a basis 
set consisting of J dimensional divergence free vector fields. To do so we 
define the following J x J matrix valued operator 

0 = -(/)jxjV2 + VV^. (7) 

Here the first term is a J x J dimensional Laplacian operator, I being the 
JxJ identity matrix and the second term consists of column and row vectors 
of the gradient operator in J dimensions. 

This operator acts on a smooth scalar function which we construct from 
a J dimensional vector x as 

= ( 8 ) 

where the symbol 11,11 is the Euclidean distance between two J dimen¬ 
sional vectors, a is chosen to be 0.7 x Ax where Ax is the spacing between 
the basis vectors. The vector xi is chosen as a constant where I goes 
from (—L, L) . 

Now we define a matrix valued function by applying the operator in 
Eq[^ to the scalar field in Eq[^ as 


Kix)jxj = {-(I)jxjV2 + VV^} {\\x-xi\ 


(9) 


+ 


J- 1 
1 


1 


a 


4IF - xi\ 




JxJ' 


„-p-X;||2/(2cr2) 


^(x - Xz)(x - X;)^ 




,-||x-5-i|lV(2<T2) 


It was proven rigorously that the columns of the above matrix consist of 
divergence free vector fields mu- For a given choice of centre we therefore 
obtain JxJ dimensional vector field. From the results in [3l |9] V.ii/ = 0. 


For a given centre xi there are J number of J dimensional mutually 
orthogonal basis vectors. Hence for each centre we have a complete basis 
set. Due to such a construction the basis vectors enforce the divergence 
free condition strictly. Each vector has a unique coefficient Ck attached to 
it,l < k < J. 

In the next section we demonstrate this for a simple 2D case. 
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Figure 1: Plot of vector field vl and respectively. 

3.1 2D case 

To demonstrate how the vector field looks like we take a simple 2D case and 
define the scalar operator in Eqj^ as 


(t^i II* - 0|| = 


V27ra'^ 


^-{x^+y^)l2a^ 


The matrix valued function <ho-(x) 3 x 3 becomes 


^ _ „-{x^+y^)l2a^ (-^ + ^ 

I xy 

\ 





( 10 ) 


( 11 ) 


The vectors v], and u? defined from the columns of the above matrix as 


below are then divergence free 

1 



= /- 

\/27rcj2 


( 12 ) 

(13) 


Any linear combination of these divergence free fields is also divergence 
free. We plot these fields in Figj^for cq = ci = 1 and a = 0.35. 
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From the above plot we can see that for the 2D case we have two mutually 
orthogonal divergence free basis vectors which constitute the complete basis 
set in two dimensions. The result holds in general for any dimensions mm- 

4 Vector Transformations 

In the J dimensional space of the outcome variable the N subjects are 
represented by N points. These N points are arranged according to a certain 
density p. Karplus theorem states that for a given covariance Gaussian 
distribution maximizes entropy. Mathematically this can be written as 

J[p\ = S[pg] - 5’[/9] > 0. (14) 

Here pc is the Gaussian density. 

The covariance matrix is defined as 

C = {{x- {x)){x- {x)f). (15) 

Here the symbol () denotes the ensemble average over the N subjects im¬ 
plying that C is a J X J dimensional matrix, x denotes a J dimensional 
vector or a point in the configuration space. 

The equality sign in Eq|14| holds only if the underlying density distribution 
is a Gaussian. Now we introduce the transformation / that preserves the 
entropy, i.e, 


S[f{p)] = S[p]. (16) 

With this transformation we want to deform the density p towards a 
Gaussian density, this then becomes the following minimization problem [8j 

/min = min (J[/(/9)]). (17) 

Here G is the group of all the smooth entropy preserving transforms. 
Since the transformation leaves the entropy unchanged and as the entropy 
of a Gaussian density is proportional to the determinant of the covariance 
matrix, to solve the above minimization problem we have to minimize the 
determinant of the covariance matrix of the Gaussian density. Since by 
Karplus theorem the covariance of pc is same as that of p, we can use the 
covariance in Eq[^ and write the above minimization problem as 

/min = min detC[/(p)]. (18) 
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As a consequence of the above corollary if we have a basis set consisting 
of divergence free vector fields under which J dimensional vectors for N 
subjects are transformed, then the basis coefficients can be determined by 
minimizing the covariance matrix determinant of the transformed vectors 
with respect to the basis coefficients. 

In the next section we describe the construction of the basis set consisting 
of divergence free smooth vector fields. This construction is implemented in 
the program package ENTRA, that I have developed. 

5 Algorithm and program structure 

Given N subjects the outcome variable for each subject is a J dimensional 
vector [^. We compute the covariance matrix Eqfls] which is a J X J di¬ 
mensional matrix. The matrix is then diagonalized by an orthogonal trans¬ 
formation T. The covariance matrix C can then be written as 

C = TAT'^ (19) 

with A being the eigenvalue matrix and columns of T matrix being the 
eigenvectors of C. We construct a J x N dimensional vector by appending 
all the N number of J dimensional vectors and label it as trajectory Y . 
To transform the trajectory to principal coordinates where the mean of the 
trajectory is centered at 0 we project the eigenvectors onto the trajectory 
to get mean-centered trajectory as 

y'=T{Y-{¥)). (20) 

Eor the trajectory Y' we choose two J dimensional vectors and Y^^ and 
transform them in the vector field shown before as 

L J 

Vk = Yl + Y,Y.^a^^^Yl). ( 21 ) 

/ i 
L J 

Vm = Yi + Y,Y.^u^^^yL)- ( 22 ) 

l i 

The coefficients are chosen by minimizing the determinant of matrix C with 
respect to the basis coefficients q. The minimization is performed via con¬ 
jugate gradients method [TO] . 

The process is repeated for all the N number of J dimensional vectors. 
This in the end yields a trajectory V which has the least determinant of the 
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covariance matrix and the underlying density has the same Shannon entropy 
as that of our starting system. 


5.1 Program input 

The classes that build up the core of the program are ENTRA, trajectory 
and grid. Objects of class trajectory represent real-valued arrays. The ar¬ 
rays are stored as high dimensional vectors and the dimensionality of the 
vectors is defined by the grid class. The grid class also defines the number 
and spacing of the basis vectors. The methods provided by trajectory and 
grid classes allow to initialize the corresponding arrays, to manipulate them, 
and to store (load) them to (from) files. To start using the program few pa¬ 
rameters have to be provided these are: 
long nsources = J; 
long nsubjects = N; 

double deltx = spacing between basis functions; 

long ngpsx = Number of basis functions=L; 

long Max iter = Maximum iteration to find basis coefficients; 


6 Test Calculations 

The goal of this example is to generate a trajectory having a random under¬ 
lying density and then to transform it towards a trajectory having Gaussian 
underlying density. To generate a random trajectory we use the utility 
Random of Eigen to generate random matrix of any dimension. The pa¬ 
rameters chosen for this test are as follows; 
long nsources=10 ; 
long nsubjects=1000; 
double deltx =0.05; 
long ngpsx=80; 
long Max_iter=500; 

Results of the transformation (after a single iteration cycle) are shown 
in Fig.j^. In this example we transform a 30 dimensional vector having 
1000 configurations. The data is along these 30 axises which are labelled as 
{Xi, X 2 ■ ■ ■ Xsq). Here Xp = Yip, i.e component of the J dimensional 
vector. We plot two dimensional subspace of original 30 dimensional space 
along different axises as labelled in Fig.[^. 
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To prove that the transformation is indeed entropy conserving, we com¬ 
puted subspace entropy via histogramming for the 2D subspaces shown in 
Fig. H for the original and the transformed subspaces. 

The Matlab code file along with the output to compute entropy of sub¬ 
spaces is provided here: 

Matlab File to estimate Entropy via Histogramming for the plot in Fig.j^a 

load simulation_examplehist.dat 
X1=simulation_examplehist (: ,8); 

X2=simulation_examplehist(:,9); 

X3=simulation_examplehist (:, 11); 

X4=simulation_examplehist (:, 12); 

X11=[X1;X3]; X12=[X2;X4]; 

X = [X11,X12]; 

plotmatrix(X); 

defaultn=500; 

error(nargchk(l, 2, nargin)); if nargin < 2 n =defaultn ; end 
X = double(X); Xh = hist(X(:), n); Xh = Xh / sum(Xh(:)); 
i = find(Xh); 

h = -sum(Xh(i) .* log2(Xh(i))); 

InitialEntropy =h; 
display (InitialEntropy); 

Y1=simulation_examplehist (:, 2); 

Y2=simulation_examplehist (: ,3); 

Y3=simulation_examplehist (:, 5); 

Y 4=simulation_examplehist (: ,6); 

Y11 = [Y1;Y3]; Y12=[Y2;Y4]; 

Y=[Y11,Y12]; 

plotmatrix(Y); 

error(nargchk(l, 2, nargin)); if nargin < 2 n = defaultn; end 

Y = double(Y); Yh = hist(Y(:), n); Yh = Yh / sum(Yh(:)); 
i = find(Yh); 

htwo = -sum(Yh(i) .* log2(Yh(i))); 

TransformedEntropy=htwo; display(TransformedEntropy); 
EntropyDifference=InitialEntropy-TransformedEntropy; 
display (EntropyDifference); 

Result of the above file 
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Figure 2: 2D Historgram plots for the original (left) and transformed (right) 
subspace. 
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>> run simpleentropy InitialEntropy = 

8.7636 

TransformedEntropy = 

7.5625 

EntropyDifference = 

1.2011 

Result for the plot Eig.j^b 

>> run simpleentropy 
InitialEntropy = 

8.7770 

TransformedEntropy = 

7.6632 

EntropyDifference = 

1.1138 

Result for the plot Eig.j^c 

>> run simpleentropy 
InitialEntropy = 

8.7754 

TransformedEntropy = 

7.6483 

EntropyDifference = 

1.1271 

Two dimensional histogram plots for initial and transformed configura¬ 
tions are shown in Eig.j^. 

Now we can also compute entropy in higher dimension via histogram- 
ming. As the data is 30 dimensional we now look at the data along first 
three axises {Xi, X 2 , X^) as seen in Eig.j^, in this plot we plot the data 
along one axis versus another as labelled along with histogram along each 
axis. We also plot the same plot for the transformed data as seen in Eig.|^. 
We compute the entropy of the three dimensional data using a Matlab code 
similar to above and its results are shown here: 

Result for computation of entropy for 3D data 
>> run highdimentropy 
InitialEntropy = 
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8.8220 

TransformedEntropy = 

7.5756 

EntropyDifFerence = 

1.2464 

These results imply that in just a single iteration cycle the unknown con¬ 
figuration space density is transformed to a Gaussian density with entropy 
being conserved approximately. Clearly 1000 points are not sufficient to get 
an accurate entropy estimate and more statistics is required. This will form 
part of the work to be done. 


13 




Figure 3: 3D Historgram plots for the original subspace along three axises 
(Xi, X 2 , X^). More details in the text 
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Figure 4: 3D Scatter and Historgram plots for the original subspace along 
three axises {Xi^X 2 .,X^). More details in the text 
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7 Summary and Outlook 


ENTRA package has been presented. The aim of this package is to do 
data transformation on high dimensional data sets as found in epidemiology. 
With this transformation the underlying high dimensional density function 
is transformed that to a high dimensional Gaussian and due to the nice 
properties associated with a Gaussian distribution the further data analysis 
can be accomplished easier than before. 

The following major points need to be addressed which will form the 
main body of the work to be done at the institute, they are: 

1) The appropriate choice of the basis set vectors. The number of basis 
vectors depend on the data and need to be appropriately estimated before¬ 
hand. How exactly that can be achieved needs to be determined. 

2) Building an example with enough statistics to be able to prove that 
the entropy conservation is maintained to a high degree of accuracy. 

3) Furthermore developing full file support for epidemiologists to enable 
them to load their data and get the transformed data. 

4) Also, developing complete regression analysis to estimate conditional 
probabilities of the likes in Equation]^ in the ecosystem of ENTRA. 
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